1 Introduction

1.1 Context

Analysis of a metabolomics data set (i.e. samples times variables table of peak or bucket intensities generated by preprocessing tools such as XCMS) is aimed at mining the data (e.g. trends and outliers) and detecting features of predictive value (biomarker discovery). It comprises multiple steps including:

  • Post-processing (quality control, normalization and/or transformation of intensities)

  • Univariate hypothesis testing

  • Multivariate modeling

The metabolis package addresses the two first steps, and can be combined with other packages for multivariate modeling, such as the ropls and biosigner Bioconductor packages, as described below.

1.2 Methods

Methods from the metabolis, ropls, and biosigner packages for the analysis of metabolomics datasets; specific parameter values used for the sacurine dataset described in the Hands-on part below are provided as examples.

Methods from the metabolis, ropls, and biosigner packages for the analysis of metabolomics datasets; specific parameter values used for the sacurine dataset described in the ‘Hands-on’ part below are provided as examples.

1.3 Formats

1.3.1 3 tabular file format used for import/export

Input (i.e. preprocessed) data consists of a ‘samples times variables’ matrix of intensities (datMatrix numeric matrix), in addition to sample and variable metadata (sampleMetadata and variableMetadata data frames). Theses 3 tables can be conveniently imported to/exported from R as tabular files:

3 table format used as input/output from the data analysis workflow.

3 table format used as input/output from the data analysis workflow.

1.3.2 ExpressionSet class used within the data analysis workflow

Within the R workflow, the ExpressionSet class perfectly handles these 3 tables (for additional information about ExpressionSet class, see the ‘An introduction to Biobase and ExpressionSets’ documentation from the Biobase package). The following table describes useful Biobase methods for the handling of ExpressionSet objects:

Biobase methods Description
exprs(eset) ‘variable times samples’ numeric matrix - dataMatrix
pData(eset) sample metadata data frame - sampleMetadata
fData(eset) variable metadata data frame - variableMetadata
sampleNames(eset) sample names
featureNames(eset) variable names
dims(eset) 2-element numeric vector of ‘Features’ and ‘Samples’ dimensions
varLabels(eset) Column names of the sampleMetadata, pData(eset)
fvarLabels(eset) Column names of the variableMetadata, fData(eset)

1.3.3 MultiDataSet class for multiple datasets

All metabolis methods (including metRead and metWrite) can also be applied to multiple omics datasets: in this case, the MultiDataSet class is used instead of ExpressionSet (Hernandez-Ferrer et al. 2016).

1.4 Text and graphical outputs

Text and graphics can be handled with the metabolis methods by setting the two arguments:

  • info.txtC: if set to NA [default], messages are displayed interactively; if set to a character corresponding the a filename (with the ‘.txt’ extension), messages are diverted to this file by using the sink function internally (the same file can be appended by successive methods); if set to NULL, messages are suppressed

  • fig.pdfC: if set to NA [default], graphics are displayed interactively; if set to a character corresponding the a filename (with the ‘.pdf’ extension), a pdf file with the plot is generated instead; if set to NULL, graphics are suppressed

1.5 Availability

The metabolis package can be installed from github:

install.packages("devtools", dep=TRUE)  
devtools::install_github("https://github.com/ethevenot/metabolis")

1.6 Acknowledgements

This package was developed within the Metabolomics Data Sciences and Integration team at CEA, including Natacha Lenuzza, Pierrick Roger, Philippe Rinaudo, Alexis Delabriere, Camille Roquencourt, Alyssa Imbert.

Interactions with experiments from the Drug Metabolism Research Laboratory were critical to develop optimal methods for quality control and normalization, including Aurelie Roux, Samia Boudah, Florence Castelli, Benoit Colsch, Christophe Junot, Francois Fenaille.

Discussions with bioinformaticians and biostatisticians from the MetaboHUB infrastructure for metabolomics and fluxomics, and the Workflow4metabolomics Galaxy project were also of high value, including Marie Tremblay-Franco, Jean-Francois Martin, Melanie Petera, Yann Guitton, Gildas Le Corguille, Christophe Caron, Franck Giacomoni, Fabien Jourdan, Dominique Rolin.

2 Hands-on

2.1 The sacurine cohort study

As an example, we will use the metabolis package to study the sacurine human cohort. The study is aimed at characterizing the physiological variations of the human urine metabolome with age, body mass index (BMI), and gender (E. A. Thévenot et al. 2015). Urine samples from 184 volunteers were analyzed by reversed-phase (C18) ultrahigh performance liquid chromatography (UPLC) coupled to high-resolution mass spectrometry (LTQ-Orbitrap). Raw data are publicly available on the MetaboLights repository (MTBLS404).

This vignette describes the statistical analysis of the data set from the negative ionization mode (113 identified metabolites at MSI levels 1 or 2):

  • metCorrect: Correction of the signal drift by local regression (loess) modeling of the intensity trend in pool samples (Dunn et al. 2011); Adjustment of offset differences between the two analytical batches by using the average of the pool intensities in each batch (Kloet et al. 2009)

  • Variable quality control by discarding features with a coefficient of variation above 30% in pool samples

  • Normalization by the sample osmolality

  • metTransform: log10 transformation

  • metView: Computing metrics to filter out outlier samples according to the Weighted Hotellings’T2 distance (Tenenhaus 1999), the Z-score of one of the intensity distribution deciles (Alonso et al. 2011), and the Z-score of the number of missing values (Alonso et al. 2011). A 0.001 threshold for all p-values results in the HU_096 sample being discarded

  • metTest: Univariate hypothesis testing of significant variations with age, BMI, or between genders (Student’s T test with Benjamini Hochberg correction)

  • PCA exploration of the data set; ropls Bioconductor package (E. A. Thévenot et al. 2015)

  • metHeatmap: Hierarchical clustering

  • OPLS(-DA) modeling of age, BMI and gender; ropls Bioconductor package (E. A. Thévenot et al. 2015)

  • Selection of the features which significantly contributes to the discrimination between gender with PLS-DA, Random Forest, or Support Vector Machines classifiers; biosigner Bioconductor package (Rinaudo et al. 2016)

A Galaxy version of this analysis is available W4M00001 ‘Sacurine-statistics’ on the Workflow4metabolomics.org online infrastructure (Guitton et al. 2017)

We start by loading the metabolis package

suppressMessages(library(metabolis))

2.2 metRead: Reading the data

The metRead function reads the data sets and builds the ExpressionSet object. For additional information about ExpressionSet class, see the “An introduction to Biobase and ExpressionSets” documentation from the Biobase package.

sacSet <- metabolis::metRead(system.file("extdata/sacurine",
                                         package = "metabolis"))
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 113 features, 210 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: QC1_001 HU_neg_017 ... QC1_012_b2 (210 total)
##   varLabels: subset full ... gender (10 total)
##   varMetadata: labelDescription
## featureData
##   featureNames: (2-methoxyethoxy)propanoic acid isomer
##     (gamma)Glu-Leu/Ile ... Xanthosine (113 total)
##   fvarLabels: database_identifier chemical_formula ... reliability
##     (10 total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:

2.3 metView: Looking at the data

sacSet <- metabolis::metView(sacSet)
## 
## 
## Data description:
## 
## observations: 210 
## variables: 113 
## missing: 0 
## 0 values (%): 0 
## min: 506.5 
## mean: 6700000 
## median: 670000 
## max: 6.8e+08 
## 
## Sample types:
## 
##   pool sample 
##     26    184

2.4 Post-processing

2.4.1 metCorrect: Correcting signal drift and batch effect

sacSet <- metabolis::metCorrect(sacSet,
                                referenceSampleTypeC = "pool",
                                colnameBatchC = "batch",
                                colnameInjectionOrderC = "injectionOrder",
                                colnameSampleTypeC = "sampleType")
## 
## Reference observations are:  pool 
## 
## Processing batch:
##  ne1
##  ne2

2.4.2 Variable filtering

  • using metView to compute the ‘pool_CV’ metric
sacSet <- metabolis::metView(sacSet)
## 
## 
## Data description:
## 
## observations: 210 
## variables: 113 
## missing: 0 
## 0 values (%): 0 
## min: 268.1565 
## mean: 6500000 
## median: 710000 
## max: 5.1e+08 
## 
## Sample types:
## 
##   pool sample 
##     26    184

  • discarding features with ‘pool_CV’ > 0.3
sacSet <- sacSet[Biobase::fData(sacSet)[, "pool_CV"] <= 0.3, ]
  • discarding the ‘pool’ observations
sacSet <- sacSet[, Biobase::pData(sacSet)[, "sampleType"] != "pool"]
print(sacSet)
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 110 features, 184 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: HU_neg_017 HU_neg_018 ... HU_neg_146_b2 (184 total)
##   varLabels: subset full ... deci_pval (15 total)
##   varMetadata: labelDescription
## featureData
##   featureNames: (2-methoxyethoxy)propanoic acid isomer
##     (gamma)Glu-Leu/Ile ... Xanthosine (110 total)
##   fvarLabels: database_identifier chemical_formula ... pool_CV (11
##     total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:

2.4.3 Normalizing

Biobase::exprs(sacSet) <- sweep(Biobase::exprs(sacSet),
                                2,
                                Biobase::pData(sacSet)[, "osmolality"],
                                "/")

2.4.4 metTransform: Transforming the data intensities

sacSet <- metabolis::metTransform(sacSet,
                                  methodC = "log10")
## 
## Missing values in the 'dataMatrix': 0 (0%)
## 
## Zero values in the 'dataMatrix': 0 (0%)
## 
## 'log10' transformation

2.4.5 Sample filtering

sacSet <- metabolis::metView(sacSet)
## 
## 
## Data description:
## 
## observations: 184 
## variables: 110 
## missing: 0 
## 0 values (%): 0 
## min: 0.1213445 
## mean: 3 
## median: 3 
## max: 5.9 
## 
## Sample types:
## 
## sample 
##    184

sacSet <- sacSet[, Biobase::pData(sacSet)[, "hotel_pval"] >= 0.001 &
                   Biobase::pData(sacSet)[, "miss_pval"] >= 0.001 &
                   Biobase::pData(sacSet)[, "deci_pval"] >= 0.001]

Final visual check of the data before performing the statistics

metabolis::metView(sacSet)
## 
## 
## Data description:
## 
## observations: 182 
## variables: 110 
## missing: 0 
## 0 values (%): 0 
## min: 0.1391968 
## mean: 3.1 
## median: 3 
## max: 5.9 
## 
## Sample types:
## 
## sample 
##    182

2.5 metTest: Univariate hypothesis testing

sacSet <- metabolis::metTest(sacSet,
                             factor = "gender",
                             testC = "ttest",
                             adjustC = "BH",
                             adjustThreshN = 0.05)
## 
## Performing 'ttest'

## 
## 40 variables (36%) were found significant at the 0.05 level (after the 'BH' correction).
## The first 15 are displayed below (sorted by increasing corrected p-values):
##                                   gender_ttest_Female.Male_dif
## p-Anisic acid                                       -1.0003177
## Testosterone glucuronide                             0.5815702
## Malic acid                                          -0.2185078
## Pantothenic acid                                    -0.2158701
## Acetylphenylalanine                                 -0.2602737
## p-Hydroxyhippuric acid                              -0.1561200
## Citric acid                                         -0.1078489
## alpha-N-Phenylacetyl-glutamine                      -0.1120281
## Oxoglutaric acid                                    -0.2730331
## 4-Acetamidobutanoic acid isomer 3                   -0.1762354
## N-Acetyl-aspartic acid                              -0.1038041
## Gluconic acid and/or isomers                        -0.1319615
## Monoethyl phthalate                                 -0.3928070
## Glyceric acid                                       -0.1312902
## Glucuronic acid and/or isomers                      -0.1196951
##                                   gender_ttest_Female.Male_BH
## p-Anisic acid                                    5.878932e-11
## Testosterone glucuronide                         2.171066e-10
## Malic acid                                       9.323122e-09
## Pantothenic acid                                 1.244134e-07
## Acetylphenylalanine                              4.026579e-06
## p-Hydroxyhippuric acid                           1.220371e-05
## Citric acid                                      3.071374e-05
## alpha-N-Phenylacetyl-glutamine                   7.754774e-05
## Oxoglutaric acid                                 1.633893e-04
## 4-Acetamidobutanoic acid isomer 3                2.162307e-04
## N-Acetyl-aspartic acid                           2.162307e-04
## Gluconic acid and/or isomers                     4.071573e-04
## Monoethyl phthalate                              1.280687e-03
## Glyceric acid                                    1.722232e-03
## Glucuronic acid and/or isomers                   1.962875e-03

2.6 Unsupervised analysis

2.6.1 PCA modeling

ropls Bioconductor package

(E. A. Thévenot et al. 2015)

suppressMessages(library(ropls))
sacPca <- ropls::opls(sacSet)
## PCA
## 182 samples x 110 variables
## standard scaling of predictors
##       R2X(cum) pre ort
## Total    0.526   9   0

plot(sacPca,
     parAsColFcVn = Biobase::pData(sacSet)[, "gender"],
                                   typeVc = "x-score")
## Warning: Character 'parAsColFcVn' set to a factor

plot(sacPca,
     parAsColFcVn = Biobase::pData(sacSet)[, "age"],
                                   typeVc = "x-score")

plot(sacPca,
     parAsColFcVn = Biobase::pData(sacSet)[, "bmi"],
                                   typeVc = "x-score")

sacSet <- ropls::getEset(sacPca)

2.6.2 metHeatmap: hierarchical clustering

sacSet <- metabolis::metHeatmap(sacSet, correlC = "spearman",
                                clustVi = c(5, 3))

2.7 Supervised modeling

2.7.1 PLS modeling

ropls Bioconductor package

(E. A. Thévenot et al. 2015)

sacPlsda <- ropls::opls(sacSet, "gender")
## PLS-DA
## 182 samples x 110 variables and 1 response
## standard scaling of predictors and response(s)
##       R2X(cum) R2Y(cum) Q2(cum) RMSEE pre ort pR2Y  pQ2
## Total    0.278    0.731   0.562 0.261   3   0 0.05 0.05

sacSet <- ropls::getEset(sacPlsda)

2.7.2 Feature selection

biosigner Bioconductor package

(Rinaudo et al. 2016)

suppressMessages(library(biosigner))
set.seed(123)
sacSign <- biosigner::biosign(sacSet, "gender")
## Warning: 'y' character vector converted to a factor with levels 'Female',
## 'Male'
## Significant features from 'S' groups:
##                          plsda randomforest svm
## Testosterone glucuronide "S"   "S"          "S"
## p-Anisic acid            "S"   "A"          "S"
## Oxoglutaric acid         "C"   "S"          "S"
## Pantothenic acid         "A"   "B"          "S"
## Acetylphenylalanine      "B"   "B"          "S"
## Ketoleucine              "E"   "E"          "S"
## Malic acid               "S"   "E"          "E"
## Taurine                  "E"   "E"          "S"
## Accuracy:
##      plsda randomforest   svm
## Full 0.871        0.861 0.891
## AS   0.867        0.869 0.911
## S    0.884        0.860 0.906

set.seed(NULL)
sacSet <- biosigner::getEset(sacSign)

2.8 metWrite: Exporting the results

metabolis::metWrite(sacSet, dirC = getwd())

References

Alonso, Arnald, Antonio Julia, Antoni Beltran, Maria Vinaixa, Marta Diaz, Lourdes Ibanez, Xavier Correig, and Sara Marsal. 2011. “AStream: An R Package for Annotating LC/MS Metabolomic Data.” Bioinformatics 27 (9): 1339–40. doi:10.1093/bioinformatics/btr138.

Dunn, Warwick B, David Broadhurst, Paul Begley, Eva Zelena, Sue Francis-McIntyre, Nadine Anderson, Marie Brown, et al. 2011. “Procedures for Large-Scale Metabolic Profiling of Serum and Plasma Using Gas Chromatography and Liquid Chromatography Coupled to Mass Spectrometry.” Nature Protocols 6 (7): 1060–83. doi:10.1038/nprot.2011.335.

Guitton, Yann, Marie Tremblay-Franco, Gildas Le Corguillé G., Jean-Francois Martin, Melanie Pétéra, Pierrick Roger-Mele, Alexis Delabrière, et al. 2017. “Create, Run, Share, Publish, and Reference Your LC-MS, FIA-MS, GC-MS, and NMR Data Analysis Workflows with the Workflow4Metabolomics 3.0 Galaxy Online Infrastructure for Metabolomics.” The International Journal of Biochemistry & Cell Biology 93: 89–101. doi:10.1016/j.biocel.2017.07.002.

Hernandez-Ferrer, Carles, Carlos Ruiz-Arenas, Alba Beltran-Gomila, and Juan R. Gonzalez. 2016. “MultiDataSet: An R Package for Encapsulating Multiple Data Sets with Application to Omic Data Integration.” BMC Bioinformatics 18 (PMC5240259): 36. doi:10.1186/s12859-016-1455-1.

Kloet, Frans M. van der, Ivana Bobeldijk, Elwin R. Verheij, and Renger H. Jellema. 2009. “Analytical Error Reduction Using Single Point Calibration for Accurate and Precise Metabolomic Phenotyping.” Journal of Proteome Research 8 (11): 5132–41. doi:10.1021/pr900499r.

Rinaudo, Philippe, Samia Boudah, Christophe Junot, and Etienne A Thévenot. 2016. “Biosigner: A New Method for the Discovery of Significant Molecular Signatures from Omics Data.” Frontiers in Molecular Biosciences 3: –. doi:10.3389/fmolb.2016.00026.

Tenenhaus, M. 1999. “L’approche PLS.” Revue de Statistiques Appliquees 47: 5–40.

Thévenot, Etienne A., Aurélie Roux, Ying Xu, Eric Ezan, and Christophe Junot. 2015. “Analysis of the Human Adult Urinary Metabolome Variations with Age, Body Mass Index and Gender by Implementing a Comprehensive Workflow for Univariate and OPLS Statistical Analyses.” Journal of Proteome Research 14 (8): 3322–35. doi:10.1021/acs.jproteome.5b00354.